By the end of the lab, you will be able to …
The guiding principle for workflow.
A workflow of data analysis is a process for managing all aspects of data analysis.
Planning, documenting, and organizing your work; cleaning the data; creating, renaming, and verifying variables; performing and presenting statistical analyses; producing replicable results; and archiving what you have done are all integral parts of your workflow.
| Set up | Systematic organization of the project and project files. |
| Familiarize self with data | Skipping takes more time in the long run. |
| Process data | Takes the MOST time. |
| Running analyses | What people THINK takes the most time. |
| Presenting results | What people (wrongly) think does not take time. |
There are many file types, but these are key to an R & RStudio workflow (and likely new to you):
| Extension | Description |
|---|---|
| .Rproj | RStudio project file (keeps project settings). |
| .R | R scripts store a sequence of R commands (code) that can be run all at once or line by line. |
| .qmd | Quarto Markdown creates reproducible documents that contain a combination of text, code, and output. |
| .Rdata (or sometimes .rda) | These store and load R objects—like data frames. |
should be:
Create a RStudio project for each data analysis project.
It supports an organized and reproducible workflow, cleanly separated from all other projects that you are working on. Everything you need in one place:
Adopting a project-based workflow avoids changing file paths.
ABSOLUTE FILE PATHS
Department of Sociology
Unit 17100, 17th Floor, Ontario Power Building
700 University Ave., Toronto, ON M5G 1Z5
C:\Users\Pepin\GitHub\SOC6302\scripts
RELATIVE FILE PATHS
Take the left side elevators to the 17th floor.
Go through the double doors and a take a right.
First door on your left.
here(scripts)
Sit back and enjoy the show!
There are four key regions or “panes” in the interface:
Source pane: where you can edit and save R scripts or author computational documents like Quarto and R Markdown.
Console pane: is used to write short interactive R commands.
Environment pane: displays temporary R objects created during that R session.
Output pane: displays the plots, tables, or HTML outputs of executed code along with files saved to disk.
Heads Up!
The top-left panel (source pane) and can be launched by opening any editable file in RStudio.
Open RStudio, then click the dropdown arrow next to the “New File icon,” and then “R script” or “Quarto Document.”
Clear the memory at every restart of RStudio by turning off the automatic saving of your workspace and .Rdata files with you quit RStudio. This is important for reproducibility, debugging, and avoiding littering your computer with unnecessary files.
Set this via:
CRAN is like an App Store for R. It hosts R packages, documentation, and source code contributed by users worldwide. It is mediated (e.g., quality controlled), making it incredibly reliable.
R users can easily install, update, and share R packages using install.packages().
R comes with basic tools, but packages extend the capabilities of base R (what you already installed). An R package is like a toolbox: a collection of functions, data, and documentation that help you do specific tasks using R.
You’ll install each package (only once per system):
You’ll load each package (every time you use it):
Some help videos and further explanation:
The tool you’ll use to create reproducible computational documents. Every piece of assignment you hand in will be a Quarto document.
great for learning, exploring and tinkering.
rerun it without attention to formatting or markdown.
great for communicating analysis and results
combines narrative explanation with code output (results.
Sit back and enjoy the show!
message: false hides any messages emitted by the code in your rendered documentTo create a new project in RStudio, click: File > New Project.
In the New Project wizard that pops up, select: New Directory, then New Project.
Name the project “SOC6302” and click: Create Project.
This will launch you into a new RStudio Project inside a new folder called “SOC6302”.
Download and open code-along-01.qmd
We’ll use the following packages:
here() (relative file paths)tidyverse() (data wrangling)gssr() (U.S. General Social Survey data)gssrdoc() (GSS documentation)here() and tidyverse()Let’s install the two packages that are available on CRAN.
Copy and paste the following code into your Console pane. Then hit enter.
Then, do the same to install the tidyverse package.
gssr() and gssrdoc()Heads Up!
R ignores text after #. These comments describe syntax.
R version 4.5.1 (2025-06-13 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)
Matrix products: default
LAPACK version 3.12.1
locale:
[1] LC_COLLATE=English_Canada.utf8 LC_CTYPE=English_Canada.utf8
[3] LC_MONETARY=English_Canada.utf8 LC_NUMERIC=C
[5] LC_TIME=English_Canada.utf8
time zone: America/Toronto
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] gssrdoc_0.7.0 here_1.0.1 conflicted_1.2.0 summarytools_1.1.4
[5] flextable_0.9.9 kableExtra_1.4.0 labelled_2.14.1 haven_2.5.5
[9] gssr_0.7 lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1
[13] dplyr_1.1.4 purrr_1.1.0 readr_2.1.5 tidyr_1.3.1
[17] tibble_3.3.0 ggplot2_3.5.2 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.6 xfun_0.52 tzdb_0.5.0
[4] vctrs_0.6.5 tools_4.5.1 generics_0.1.4
[7] curl_6.4.0 pacman_0.5.1 pkgconfig_2.0.3
[10] checkmate_2.3.2 data.table_1.17.8 pryr_0.1.6
[13] RColorBrewer_1.1-3 uuid_1.2-1 lifecycle_1.0.4
[16] compiler_4.5.1 farver_2.1.2 rapportools_1.2
[19] textshaping_1.0.1 codetools_0.2-20 fontquiver_0.2.1
[22] fontLiberation_0.1.0 htmltools_0.5.8.1 yaml_2.3.10
[25] pillar_1.11.0 MASS_7.3-65 openssl_2.3.3
[28] cachem_1.1.0 magick_2.8.7 fontBitstreamVera_0.1.1
[31] tidyselect_1.2.1 zip_2.3.3 digest_0.6.37
[34] stringi_1.8.7 reshape2_1.4.4 pander_0.6.6
[37] rprojroot_2.1.0 fastmap_1.2.0 grid_4.5.1
[40] cli_3.6.5 magrittr_2.0.3 base64enc_0.1-3
[43] withr_3.0.2 backports_1.5.0 gdtools_0.4.2
[46] scales_1.4.0 timechange_0.3.0 rmarkdown_2.29
[49] officer_0.6.10 matrixStats_1.5.0 askpass_1.2.1
[52] ragg_1.4.0 hms_1.1.3 memoise_2.0.1
[55] evaluate_1.0.4 knitr_1.50 tcltk_4.5.1
[58] viridisLite_0.4.2 rlang_1.1.6 Rcpp_1.1.0
[61] glue_1.8.0 xml2_1.3.8 svglite_2.2.1
[64] rstudioapi_0.17.1 jsonlite_2.0.0 plyr_1.8.9
[67] R6_2.6.1 systemfonts_1.2.3 fs_1.6.6
Let’s set up your project structure using the here() package.
here()First, let’s establish our project directory
Next, we’ll create folders within our project.
Research Projects
SOC6302
using here() and dir.create()
using here() and dir.create()
report a list of folders and or files in the R-project folders and sub-folder.
Save this code-along in your newly created “code-along” sub-folder.
There’s no command in the R console to save scripts or Quarto files— you use the editor’s File > Save As or Ctrl+S.
We’re going to use data from the U.S. General Social Survey (GSS).
# Load the data (will appear in your Global Environment pane)
data(gss_all)
# Preview the datatable which is automatically named gss_all
gss_all# A tibble: 75,699 × 6,867
year id wrkstat hrs1 hrs2 evwork occ prestige
<dbl+lbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl+lb>
1 1972 1 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 205 50
2 1972 2 5 [retire… NA(i) [iap] NA(i) [iap] 1 [yes] 441 45
3 1972 3 2 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 270 44
4 1972 4 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 1 57
5 1972 5 7 [keepin… NA(i) [iap] NA(i) [iap] 1 [yes] 385 40
6 1972 6 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 281 49
7 1972 7 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 522 41
8 1972 8 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 314 36
9 1972 9 2 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 912 26
10 1972 10 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 984 18
# ℹ 75,689 more rows
# ℹ 6,859 more variables: wrkslf <dbl+lbl>, wrkgovt <dbl+lbl>,
# commute <dbl+lbl>, industry <dbl+lbl>, occ80 <dbl+lbl>, prestg80 <dbl+lbl>,
# indus80 <dbl+lbl>, indus07 <dbl+lbl>, occonet <dbl+lbl>, found <dbl+lbl>,
# occ10 <dbl+lbl>, occindv <dbl+lbl>, occstatus <dbl+lbl>, occtag <dbl+lbl>,
# prestg10 <dbl+lbl>, prestg105plus <dbl+lbl>, indus10 <dbl+lbl>,
# indstatus <dbl+lbl>, indtag <dbl+lbl>, marital <dbl+lbl>, …
# Get the data only for the 2024 survey respondents
gss24 <- gss_get_yr(2024)
# look at the first 6 rows of the dataframe
head(gss24)# A tibble: 6 × 639
year id wrkstat hrs1 hrs2 evwork marital martype
<dbl+lb> <dbl> <dbl+l> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+l> <dbl+lbl>
1 2024 1 1 [wor… 43 NA(i) [iap] NA(i) [iap] 5 [nev… NA(i) [iap]
2 2024 2 5 [ret… NA(i) [iap] NA(i) [iap] 1 [yes] 5 [nev… NA(i) [iap]
3 2024 3 5 [ret… NA(i) [iap] NA(i) [iap] 1 [yes] 1 [mar… 1 [mar…
4 2024 4 2 [wor… 20 NA(i) [iap] NA(i) [iap] 5 [nev… NA(i) [iap]
5 2024 5 5 [ret… NA(i) [iap] NA(i) [iap] 1 [yes] 3 [div… NA(i) [iap]
6 2024 6 4 [une… NA(i) [iap] NA(i) [iap] NA(i) [iap] 1 [mar… 1 [mar…
# ℹ 631 more variables: divorce <dbl+lbl>, widowed <dbl+lbl>,
# spwrksta <dbl+lbl>, sphrs1 <dbl+lbl>, sphrs2 <dbl+lbl>, spevwork <dbl+lbl>,
# cowrksta <dbl+lbl>, coevwork <dbl+lbl>, cohrs1 <dbl+lbl>, cohrs2 <dbl+lbl>,
# sibs <dbl+lbl>, childs <dbl+lbl>, age <dbl+lbl>, educ <dbl+lbl>,
# speduc <dbl+lbl>, coeduc <dbl+lbl>, codeg <dbl+lbl>, degree <dbl+lbl>,
# padeg <dbl+lbl>, madeg <dbl+lbl>, spdeg <dbl+lbl>, sex <dbl+lbl>,
# race <dbl+lbl>, res16 <dbl+lbl>, reg16 <dbl+lbl>, mobile16 <dbl+lbl>, …
With your mouse, go to the environment panel (upper-right) and click on the “gss24” object. It pops up and you can browse through it.
This is often a good idea to get a first feel for the data, but only if your dataset is relatively small.
The GSS documentation is available online in .pdf form.
The .pdfs will be useful for general overviews.
For specific variable information, it will be helpful to use the documentation you’ll load into RStudio.
To see the variables available in the dataset, use the names() command.
Heads Up!
This command is best to use with smaller datasets.
For information about a specific GSS variable,
type ?varname at the console.
In the output pane, the Help tab will show the variable documentation.
Heads Up!
Replace “varname” with the name of a variable.
Example: ?meovrwrk
meovrwrk {gssrdoc} R Documentation
Men hurt family when focus on work too much
Description
meovrwrk
Details
Question 1297. And, do you agree or disagree: c. Family life often suffers because men concentrate too much on their work.
Overview
For further details see the official GSS documentation.
Counts by year:
year iap agree can't choose disagree neither agree nor disagree no answer strongly agree strongly disagree skipped on web Total
1972 1613 - - - - - - - - 1613
1973 1504 - - - - - - - - 1504
1974 1484 - - - - - - - - 1484
1975 1490 - - - - - - - - 1490
1976 1499 - - - - - - - - 1499
1977 1530 - - - - - - - - 1530
1978 1532 - - - - - - - - 1532
1980 1468 - - - - - - - - 1468
1982 1860 - - - - - - - - 1860
1983 1599 - - - - - - - - 1599
1984 1473 - - - - - - - - 1473
1985 1534 - - - - - - - - 1534
1986 1470 - - - - - - - - 1470
1987 1819 - - - - - - - - 1819
1988 1481 - - - - - - - - 1481
1989 1537 - - - - - - - - 1537
1990 1372 - - - - - - - - 1372
1991 1517 - - - - - - - - 1517
1993 1606 - - - - - - - - 1606
1994 1545 695 33 243 286 27 122 41 - 2992
1996 1444 825 16 198 169 1 230 21 - 2904
1998 2832 - - - - - - - - 2832
2000 940 877 43 361 331 22 209 34 - 2817
2002 1857 415 6 264 108 - 99 16 - 2765
2004 1906 460 4 188 135 - 94 25 - 2812
2006 2518 945 14 477 304 1 208 43 - 4510
2008 694 653 12 310 161 - 143 50 - 2023
2010 614 662 6 388 192 3 122 57 - 2044
2012 672 558 11 382 170 - 130 51 - 1974
2014 863 702 7 479 234 1 176 76 - 2538
2016 979 819 9 536 257 - 171 96 - 2867
2018 789 644 11 475 220 2 134 73 - 2348
2021 1315 886 1 487 1001 - 202 138 2 4032
2022 1168 885 15 537 618 1 201 117 2 3544
2024 1126 787 19 481 611 - 195 89 1 3309
Total 50650 10813 207 5806 4797 58 2436 927 5 75699
Values
1 strongly agree
2 agree
3 neither agree nor disagree
4 disagree
5 strongly disagree
NA(d) can't choose
NA(i) iap
NA(j) I don't have a job
NA(m) dk, na, iap
NA(n) no answer
NA(p) not imputable
NA(r) refused
NA(s) skipped on web
NA(u) uncodeable
NA(x) not available in this release
NA(y) not available in this year
NA(z) see codebook
Source
General Social Survey https://gss.norc.org
[Package gssrdoc version 0.7.0 Index]
We can find which years one or more variables were asked with the gss_which_years() function.
# A tibble: 35 × 2
year meovrwrk
<dbl+lbl> <lgl>
1 1972 FALSE
2 1973 FALSE
3 1974 FALSE
4 1975 FALSE
5 1976 FALSE
6 1977 FALSE
7 1978 FALSE
8 1980 FALSE
9 1982 FALSE
10 1983 FALSE
# ℹ 25 more rows
Heads Up!
If run in the console, to see all rows, wrap the code in the print() command: print(gss_which_years(gss_all, meovrwrk), n = 40)
You can access the variables (i.e., columns) using the $ operator, as shown using the table() function.
The variable names are case sensitive. In this dataset, all variables are lowercase.
2,436 respondents were coded as 1 on this variable. What does that mean?
Let’s look at only the 2024 respondents.
Change the code to show just the gss24 respondents.
Then, add text to your code-along that interprets the results for the 2 value.
Finally, let’s render your code-along-01 and see the results!
Heads Up!
Click the down arrow next to render to choose whether to preview within RStudio’s Viewer Pane or in your browser.